-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] cuGraph GNN NCCL-only Setup and Distributed Sampling #4278
[FEA] cuGraph GNN NCCL-only Setup and Distributed Sampling #4278
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems alright to me. @seunghwak is a bit more familiar with some of the sampling output functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me in general, I have some minor suggestions and questions.
src = cudf.Series(np.array_split(edgelist[0], world_size)[rank]) | ||
dst = cudf.Series(np.array_split(edgelist[1], world_size)[rank]) | ||
|
||
seeds = cudf.Series(np.arange(rank * 50, (rank + 1) * 50)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just minor nitpicking suggestions.
What is 50? # seeds per rank? If this is an example,
num_seeds_per_rank = 50
seeds = cudf.Series(np.arange(rank * num_seeds_per_rank, (rank + 1) * num_seeds_per_rank))
will be more readable.
And just for completeness. Are we assuming that # ranks * # seeds per rank > # vertices in the input graph? If this code assumes anything about the input edgelist
, we may specify that in comments or we may give a check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean < # vertices? But this is really just meant to be a toy example. It's using a dataset that has far more than that number of vertices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I changed the variable to seeds_per_rank
like you suggested to make it clearer what I'm doing.
src = cudf.Series(edgelist[0]) | ||
dst = cudf.Series(edgelist[1]) | ||
|
||
seeds = cudf.Series(np.arange(0, 50)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. What happens if edgelist
is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a toy example, it's not meant to be robust, just to accept the known good input we give it (in this case the ogbn-products dataset).
rank, world_size, nccl_comms, n_streams_per_handle=0, verbose=False | ||
): | ||
handle = Handle(n_streams=n_streams_per_handle) | ||
inject_comms_on_handle_coll_only(handle, nccl_comms, world_size, rank, verbose) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we don't need p2p?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. As far as I can know we can run all the GNN algorithms without UCX p2p comms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we initialized p2p
comms in our benchmarks .
without UCX p2p comms.
Are we configuring UCX here? I thought this was for NCCL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're only using nccl here. No UCX. I think that's sufficient for what we're doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least that's what @ChuckHastings told me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, yeah, so p2p here means UCX p2p. NCCL has P2P as well. I think the naming here is confusing. NCCL-only might be more appropriate.
prows = int(math.sqrt(ngpus)) | ||
while ngpus % prows != 0: | ||
prows = prows - 1 | ||
return prows, int(ngpus / prows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI, maybe in the future, we may use a common C++ utility function to compute a desirable 2D partition (with an user option to override the default). Setting prows as close as possible to math.sqrt(ngpus) is just one possibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this from the existing dask config. I would be happy to replace this with a better function if that becomes available. We can make an issue for it.
edge_id_array_p = ( | ||
minibatch_dict["edge_id"][start_ix:end_ix] | ||
if has_edge_ids | ||
else cupy.array([], dtype="int64") | ||
) | ||
edge_type_array_p = ( | ||
minibatch_dict["edge_type"][start_ix:end_ix] | ||
if has_edge_types | ||
else cupy.array([], dtype="int32") | ||
) | ||
weight_array_p = ( | ||
minibatch_dict["weight"][start_ix:end_ix] | ||
if has_weights | ||
else cupy.array([], dtype="float32") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, are we assuming that edge IDs are always int64, edge types are always int32 (this may make sense), and edge weights are always float32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think C++ only supports int32 edge types. For edge id, the frameworks always use int64. In any case, this does actually allow for int32 edge ids, we just have to set a default empty dtype, and I just went with int64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for edge weights, I'm not technically restricting them to float32 here.
raise NotImplementedError("Must be implemented by subclass") | ||
|
||
def sample_from_nodes( | ||
self, nodes: TensorType, *, batch_size: int = 16, random_state: int = 62 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is random state a seed for random number generation? If yes, make sure that you are providing different seeds for different ranks in random number generation. If not and for deterministic random number generators, you are generating same random number sequences in every GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In C++, we take base_seed and use base_seed + rank in random number generator initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're doing that too. It's taken care of here:
random_state=random_state + rank, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the new tests and examples.
Co-authored-by: Rick Ratzel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like CI got further, but more failures that might mean we have to skip tests?
Yeah, we're skipping for now but I think long-term we may migrate some or all of this code to WholeGraph. But that's a discussion for another time. |
/merge |
pylibcugraph
without UCX/dask within PyTorch DDP.Closes #4200
Closes #4201
Closes #4246
Closes #3851